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ABSTRACT 

Variable selection for a multiple regression model (Noisy Lin- 
ear Perceptron) is studied with a mean field approximation. 
In our Bayesian framework, variable selection is formulated 
as estimation of discrete parameters that indicate a subset 
of the explanatory variables. Then, a mean field approxima- 
tion is introduced for the calculation of the posterior averages 
over the discrete parameters. An application to a real world 
example, Boston housing data, is shown. 
KEYWORDS: Probabilistic and Statistical Meth- 
ods, Learning and Generalization, Mean Field Ap- 
proximation, Model Selection, Multiple Regression, 
Bayes 

1. Mean Field Approximation 

Mean Field Approximation (MFA) is a well-known technique 
in statistical physics. In general, MFA is defined as a col- 
lection of techniques to calculate averages over multivariate 
distributions, which uses the idea of ignoring higher order cor- 
relations and solving self-consistent equations. In this paper, 
we restrict ourselves to a type of MFA, which treat distribu- 
tions with discrete variables, specifically binary variables. 

MFA has a long history in physics. It is, however, a recent 
topic in the area of statistical information processing. It is in 
the late 1980s that the mean field approximation is used in 
the study of Boltzmann machine learning An applica- 

tion in image processing also draws much attention of re- 
searchers in the area and works with cluster versions of MFA 
appeared. Recently, a higher order version of MFA ( TAP 
equation ) is also applied to the decoding of error correct- 
ing codes H and Boltzmann machine learning. A relation 
between TAP equation and belief propagation in Bayesian 
network is discussed in |B|. 

The aim of this paper is to apply the idea of MFA to a vari- 
able selection problem. In our Bayesian treatment, variable 
selection for multiple regression is formulated as estimation of 
discrete parameters that indicate a subset of the explanatory 
variables. Then, a mean field approximation is introduced 
for the calculation of the posterior marginals on the space 
of submodels, each of which is specified by a subset of the 
explanatory variables. 

Although MFA is closely related to the Hopfield-Tank ap- 
proach Q to combinatorial optimization, there is an im- 
portant conceptual difference PL While the Hopfield-Tank 
method is a tool for optimization, MFA is a method of ap- 
proximation of the averages over a distribution. MFA is a 
tool that is useful in the problems where averages over a 

1 In a sense, the relation of MFA to the Hopfield-Tank method 
is similar to that of MCMC to simulated annealing. 



multivariate distribution play important roles. Two impor- 
tant examples of such problems are the calculation of the 
averages over a Gibbs distribution in statistical physics and 
the calculation of the posterior averages in Bayesian statis- 
tics. In the present context, MFA gives posterior averages of 
discrete parameters, each of which gives the weight or "the 
probability of the existence" of the corresponding explana- 
tory variable. These averages take the values between and 
1. At this point, our work is distinguished from other works 
on variable selection with a Hopfield-Tank method, say, Sakai 
& 

2. Generalization in Neural Networks 

Here we overview the issues on the generalization in neural 
network models and present the motivation of our work in 
the context of the interpolation between ridge regression and 
variable selection. 

Generalization from a finite set of examples is one of the most 
important topics in the study of artificial neural network in 
these ten years. Actually, it is a repeat of a major theme 
in statistics — How can we beat overfitting and get a better 
performance of prediction ? For multiple regression mod- 
els (noisy linear perceptrons) and other types of feed-foward 
network models, two extreme approaches are known. 

An extreme is ridge regression or linear weight decay, where 
a penalty function of a quadratic form of synaptic weights 
(regression coefficients) is added to the target function that 
is optimized in the learning process |lo| . Information from 
given data is shared among a number of weights with this 
way of learning. The other extreme is variable selection or 
pruning , where we select a subnetwork (a submodel) with as 
small number of adjustable connections as that is enough to 
represent the essence of the data. Information is concentrated 
on small number of weights with the way of learning. 

It is natural to interpolate these two extremes and develop 
methods, such as a "fuzzy variable selection". A number 
of works in this direction are classified into two categories. 
A way to realize to such a interpolation is the use of gen- 
eralized ridge regression, or, nonlinear weight decay, where 
a nonquadratic penalty function is used as the penalty to 
the magnitude of synaptic weights (regression coefficients) 
[[| [j^, [l3| [ji^, |l2| . Another approach to this problem is the 
introduction of weighted averages over a set of models with 
different architectures (with different sets of explanatory vari- 
ables) , instead of using a single "best" model. Methods of this 
category are most easily formulated in a Bayesian framework, 
where the posterior distribution on a space of submodels are 
treated as well as the posterior distribution on a parameter 
space of each submodel H, |^, [| ^| [l^] . 



Our work belongs to the second category. We are mostly 
interested in the algorithm for the computation of posterior 
marginals in a space of submodels. The introduction of the 
posterior distribution on a set of submodels results in combi- 
natorial explosion of the requirement of computer resources, 
when the submodels are not linearly ordered. The calculation 
with exact enumeration is possible only when the number of 
the variables is small jyj . A potentially powerful tool for this 
problem is Markov Chain Monte Carlo algorithm (MCMC) 
[], [l^, |5[. The Monte Carlo approach is, however, still 
computationally expensive and often suffers from slow con- 
vergence. 

At this point we introduce a Mean Field Approximation. As 
already suggested in the previous section, the essence of our 
idea is the application MFA for calculation of the posterior 
average on the space of submodels. In the next section, wc 
will give a formulation and basic notations for the imple- 
mentation of the idea to a linear network, i.e., a multiple 
regression model. 

3. Variable Selection for a Multiple Regression 
Model 

The likelihood function of a multiple regression model is 

N M 



E L 



a — 1 i — 1 

L({y a }\M) = 



exp(-E L ) 



(1) 



(2) 



(2^2)^/2 

where y a and {xi} a are the ath observation of the response 
variable (the output of the network) and the explanatory 
variables (the inputs of the network) respectably. N is the 
number of the observations and M is the maximum number of 
explanatory variables available in data. We introduce a set of 
indicators {Si}, Si 6 {0, 1} for the description of submodels. 
If and only if Si = 1, the variable Xi appears in the submodel 
specified by {Si}. Using the indicators {Si}, a submodel is 
written as 



1\ Nl 



(3) 



L({y a }\{a i },{S i }) = 



■exp(-E L ) . (4) 



(2^2)^/2 

With this submodel, the maximum likelihood estimator of 
the regression coefficient at is obtained as the solution of the 
corresponding normal equation 



Vx^Xi^i ~t~ ^ ^ VxjXj Sj — Vx^y 



(•») 



when Si = 1. Here, sufficient statistics V y 2 ,V Xiy ,V Xi x ■ are 
defined as V y 2 = J2 a (y a ) 2 , V XiV = J2 a (xfy a ) 2 , V XiXj = 



When we define Si = 2Si — 1 and rearrange the terms, we 
have 

-E L = ^(C + J2HiSi + J2J2 J » SiS ^ ■ (6) 

i i j>i 
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(7) 
(8) 

(9) 



using Si = Si. For later convenience, the indicator Si that 
takes the values of ±1 is introduced. 

To define a Bayesian model, we should specify prior distribu- 
tions. The prior distribution of the indicators {Si} is assumed 
as 

exp(-hJ2iSi) 



Am) = 



(10) 



(exp(h) + exp(-/i)) M ' 
where the hyperparameter h determines the degree of the 
penalty to submodels with larger number of explanatory vari- 
ables. In this paper, we also assume that the prior of the coef- 
ficients {en} is the uniform density in R M . Then the posterior 
distribution P pos ({ai}, {Si}) of {Si}, {ai} is formally defined 
as a Gibbs distribution 



P P os({ai},{Si}) = 



exp(—E pos 

ZpoS 



(11) 



with the energy —E pos — —El — h/,. Si and the normal- 
ization Z pos = £{s } / H»^ a * exp(— _E pos ) 0. Here and here- 
after, £{ S .} denote the summation over 2 M configurations 
of {S t }. 

4. MFA for Bayesian Variable Selection 

Now, we discuss an application of Mean Field Approxima- 
tion to variable selection for multiple regression. First, we 
consider the posterior distribution of {Si} conditioned with 
a given set {en} of regression coefficients, 



Ps({Si}\Ui}) 



cxp(-E pos ) 
Z~ s 



The normalization constant Zs is defined by 
Z s = y exp(-E pos ) . 



£• 

{Si} 



Consider a distribution 

Q{{s z }) = n, 

and the KL-divergence 

D(Q;Ps) = ^Q({^})lo; 

{Si} 



1 + rm • Si 



Q({Si}) 



Ps({Si})\{ai}) 



(12) 
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between Q and Ps- An easy calculation shows that 
D(Q; Ps)= -M C + Ei E,>, J^ imj + £. Hi 
+hJ2i m i + logZs 



- log- 



+ ■ 



- log- 



-}(16) 



2 The integral in the definition of Z pos diverges with the uniform 
prior for {a;}. Apparently, this is not significant in our approxima- 
tion, where a point estimate {a*} of the regression coefficients is 
used. It might cause, however, some difficulties, e.g. in a Bayesian 
interpretation of the strength h of the penalty. 



The values {m*} of {m,} that minimize D(Q; Ps) is a solu- 
tion of the "self-consistent equation" 



tanh 
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From the inequality D(Q;Ps) > 0, the relation logZg > 
log Z* MF is derived, where 

log Z* MF = 

^(c + E. H im * + e, £ j>4 - ftE, < 

.logi^i + i^logi^} . (18) 

The equality is hold only when the distributions Q and Ps 
are identical. The form of the distribution ( |l4j ) suggests that 
the ith component m* of a solution of the equation (^) is 
regarded as an approximation of the averages of the indicator 
S» over the distribution (|l2"|). Thus the posterior probability 
Pi that the variable Xi is contained in a submodel is approx- 
imated by 

rf = ^ ■ (19) 

We can also regard log Zjv/F as an approximation of log Zs . 
The self-consistent equation (^) and the expressions ( |l9| ) 
and (JXs[) are the essence of the MFA in our problem. 

So far, we are working with a given set of regression coeffi- 
cients {en} . In our problem, they are unknown parameters. 
A way to deal with them is to replace them with point esti- 
mates that maximize the likelihood 



l({oi}, a 2 ) = log {&})*({&}) 
{Si} 



(20) 



marginalized over {Si} 1. It is represented by the normaliza- 
tion constant Zs of ([13 ) as 



l{{cn},o 2 ) = log Z s 



M\og(e h + e~ h ) 



-log27ra 2 . (21) 



This marginal likelihood ( pl| ) is approximated by the MFA 



N 2 
— log 2na 



lMF({a,i}, a 2 ) = \ogZ* MF - Mlog(e + e h ) 

(22) 

The values {a* } that maximize ( 22j) is a solution of a "soft 
version" of the normal equation 



V XiXi a* + Vx iXj a*p* = V XiV 



(23) 



The equation (|2^) is, in fact, a equation nonlinear in {a*}, 
because the probability {» ?} is a function of {a* } implicitly 
determined with ( |l9| ) and (PlflPI) ■ In actual implementation, 
the following algorithm is used to solve the set of equations. 

(1) Set the initial values {m® } of {mi} and t := 0. 

3 Although the use of a point estimation of {ai} is not fully 
justified in a Bayesian paradigm and leads to an approximation 
beyond the conventional use of MFA, it greatly simplified the 
situation. 

4 Note that the values {m*} should be considered as a 
function of {a*} in the derivation. However, with the use of 

^fa^ 5 ^ I { m i }={ m *} = 1' ^ e resm t coincides with that of a naive 
calculation. 



(2) Solve the equation of {a^} 



t t t r 



(24) 



with p* = (1 + m*)/2. 
(3) Iterate the equation 

m' +1 := (1 — (5) ■ m* + S ■ tanh 



2(7 2 



(25) 



Here, the variables {Hi},{Jij} are defined by (|^,^) with a,: 
a*. 5 is a constant that satisfies < 5 < 1. 

(4) Set t := t + 1. Check the convergence and return to the 
step (2), if necessary. If convergence is ensured, set p* := 
(1 + m*)/2 and a* := a\ and exit. 

We can also estimate the variance a 2 of the residual with the 
maximization of (^2|). For this purpose, we add a step for the 
estimation of a 2 to the iteration. In our implementation, the 
value of a 2 is updated once per the iteration. 

5. An Example 

In this section, we discuss an application of the proposed 
method to a real world example. We use the dataset taken 
from p. 244 of Belsley et al. M. This dataset, known as 
Boston housing data, is also analyzed by other authors with 
different methods ji], |l^, [l7|. There are thirteen explanatory 
variables (M = 13) and one response variable, the logarithm 
of the median value of owner-occupied homes for each area 
of Boston. The sample size is N = 506. 

Our results on this data are shown in Fig.l and Fig. 2. In 
the experiment, the variance a 2 is estimated from the data. 
Although we observe local optima with some parameters of 
h, they are not serious and the solution with the highest 
frequency at each value of h is plotted in the figures. 
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Fig.l: The results with some typical values of h. The index 
i of a variable is shown in the horizontal axis and the proba- 
bility p* = (1 + m*)/2 is shown in the vertical axis. 
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Fig. 2: The change of p* is shown with h. The number 
used as a symbol indicates the index i of the variable. The 
positions of the symbols are slightly perturbed by noise to 
prevent complete overlap between them. Lines are accurate. 

In Fig. 2, the probability p* takes the value p* ~ 1, for 
i — 1,6,13, even with a larger value of the penalty h. It 
seems consistent with the result in |r7|. For most of the 
other variables, the probability p* shows a sudden jump be- 
tween and 1, or, a simple decay with p* ~ cxp (fc)+rcp(-h) ' 
as the value of h is increased. However, for some i, say i = 4, 
it shows a gradual decay and has nontrivial values between 
and 1 in a range of h. 

6. Future Problems 

In the present treatment, the strength h of the penalty to 
the number of variables is a free hyperparameter. It is not a 
serious fault when we use the proposed method as a tool for 
regression diagnostics , i.e., we observe the behavior of solu- 
tions in different values of h and get information on data. A 
criterion to determine the optimal value of h is, however, re- 
quired, when the method is used as a fully automatic tool for 
the prediction. Although there are a few possible approaches, 
their implementations and tests are left for future studies. 

Another important problem is the study of the multiple so- 
lutions in our iterative procedure. For example, when near 
loss of the linear independence among explanatory variables 
(colinearity) exists, the algorithm is likely to have multiple 
fixed points. Note that the existence of local optima is usu- 
ally regarded as a mere difficulty of algorithms in information 
processing. There seems, however, often a possibility of ex- 
ploration of the nature of the data from the emergence of 
multiple extremum. It is interesting if the proposed algo- 
rithm provides an example for such a study. For this pur- 
pose, it requires more investigation into the behavior of the 
algorithm in complex situations. 

We acknowledge Dr. Kabashima for fruitful discussions on the 
work. This work is supported by RWC Project of Ministry 
of International Trade and Industry. 
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